Confidence Intervals

Purpose: take data from multiple runs and create aggregate data that is useful for drawing conclusions



In [1]:

    
%matplotlib inline
import pylab
import numpy as np

Let's say we have some model of some task and we want to know how well it's doing. Maybe we're about to change some parameter and want to to know whether changing that parameter makes the model works better. But every time we run our model we get a slightly different result, due to random variation.

Instead of running the model once, we might run the model 3 times and take the average. But what does that actually tell us? It tells us the average of those 3 runs. We don't care about those particular 3 runs: we care about what the actual average is: i.e. if we ran it an infinite number of times and took the average. But that might take too long.

The goal of this notebook is to deal with exactly this problem: what can you conclude about the underlying distribution, given only samples of that distribution.

An example sampling

Here are 3 samples from a normal distribution. Since it's the normal distribution, we know that the actual mean if we did an infinite number of samples is 0.



In [2]:

    
rng = np.random.RandomState(seed=0)

data = rng.normal(size=3)

pylab.scatter(np.zeros_like(data), data, marker='x', s=100, c='k')
pylab.xticks([])
pylab.yticks([])
pylab.show()

For this simple example, where do you think the actual mean is on this graph? Note: this is just meant to be a typical example and I'm not doing anything tricky here like searching for a random number seed that gives horrible results (seed=0).

What do you think the chances are that the mean is above the largest value or below the smallest value?



In [3]:

    
rng = np.random.RandomState(seed=0)

data = rng.normal(size=3)
pylab.scatter(np.zeros_like(data), data, marker='x', s=100, c='k')
pylab.xticks([])
pylab.show()

Ouch. Even without trying to find an example where the sampling is horribly off, we have one. If I was working with a model and I made a change and the average used to be zero, and now it's this, then I would have convinced myself that I made an improvement, when I actually haven't.

Let's see how common this is.



In [6]:

    
def test():
    data = rng.normal(size=3)
    if np.min(data) > 0 or np.max(data) < 0:
        return 1
    else:
        return 0
    
results = [test() for i in range(100000)]
print np.mean(results)

Does this depend on the underlying distribution?

Let's try it with a uniform distribution.



In [10]:

    
def test():
    data = rng.uniform(size=3)
    if np.min(data) > 0.5 or np.max(data) < 0.5:
        return 1
    else:
        return 0
    
results = [test() for i in range(100000)]
print np.mean(results)

Doesn't seem to depend on the distribution, which is good (because in a real situation, we don't know what the underlying distribution is!

Why doesn't it rely on the distribution? And how does it change with the number of samples?



In [18]:

    
def test_inside_bounds(n_samples):
    data = rng.normal(size=n_samples)
    if np.min(data) > 0 or np.max(data) < 0:
        return 1
    else:
        return 0
    
Ss = [2, 3, 4, 5, 6, 7, 8, 9, 10]
rs = []
for S in Ss:
    results = [test_inside_bounds(n_samples=S) for i in range(10000)]
    rs.append(np.mean(results))
    print S, rs[-1]

pylab.plot(Ss, rs)
pylab.xlabel('number of samples')
pylab.ylabel('probability of error')
pylab.show()

This is for the incredibly conservative scenario where we just take the largest and smallest sample value and only conclude that our actual mean is somewhere inbetween those values.

Can you see why this decreases by around 50% each time?

This method only fails if the samples happen to all be above the mean or all be below the mean. What are the chances of that happening? (This also explains why it doesn't depend on the distribution!)

However, this also produces huge ranges, since the more samples you have the wider the bounds will be.



In [33]:

    
def test_inside_bounds(n_samples):
    data = rng.normal(size=n_samples)
    ci = np.min(data), np.max(data)
    if np.min(data) > 0 or np.max(data) < 0:
        return 1, ci
    else:
        return 0, ci
    
Ss = [2, 3, 4, 5, 6, 7, 8, 9, 10]
rs = []
mins = []
maxs = []
for S in Ss:
    results = [test_inside_bounds(n_samples=S) for i in range(10000)]
    rs.append(np.mean([r[0] for r in results]))
    mins.append(np.mean([r[1][0] for r in results]))
    maxs.append(np.mean([r[1][1] for r in results]))
    print S, rs[-1]

pylab.fill_between(Ss, mins, maxs, color='#888888')
pylab.xlabel('number of samples')
pylab.ylabel('confidence interval')
pylab.show()

What could we do instead?

Confidence Intervals

In your stats class, they dealt with this problem by computing a "confidence interval".

A 95% confidence interval says that "If you assume the actual value for whatever distribution you are sampling from is inside this range, then you will only be wrong 5% of the time".

Note that this does not tell you whether you're wrong in this particular case. There's no way to know that. But on average, this should give you a range that has the right answer in it 95% of the time.

If the assumptions hold. This is a big IF.

The main one they covered in your stats class was IF the data is normally distributed, then the confidence interval is $\bar x \pm {s \over \sqrt N}t_{N-1, 95\%}$ where:

$\bar x$ is the sample mean
$s$ is the sample standard deviation
$N$ is the number of samples
$t_{N-1, 95\%}$ is the magic scaling factor that we look up on the table for Student's t-distribution.

Here it is computed for the samples we started with



In [20]:

    
import scipy.stats

def normal_ci(data, p=0.95):
    t = np.array(scipy.stats.t.interval(0.95, len(data)-1))
    ci = np.mean(data)+np.std(data)/np.sqrt(len(data))*t
    return ci
print data
print normal_ci(data)









    



[ 1.76405235  0.40015721  0.97873798]
[-0.34082463  2.43612299]

So this says that the mean is somewhere between -0.34 and 2.44, which is correct, as the mean is actually 0!

What happens with more samples?



In [39]:

    
def test_inside_bounds(n_samples):
    data = rng.normal(size=n_samples)
    ci = normal_ci(data)
    if ci[0] > 0 or ci[1] < 0:
        return 1, ci
    else:
        return 0, ci
    
Ss = [2, 3, 4, 5, 6, 7, 8, 9, 10]
rs = []
mins = []
maxs = []
for S in Ss:
    results = [test_inside_bounds(n_samples=S) for i in range(10000)]
    rs.append(np.mean([r[0] for r in results]))
    mins.append(np.mean([r[1][0] for r in results]))
    maxs.append(np.mean([r[1][1] for r in results]))
    print S, rs[-1]

    
pylab.figure()    
pylab.plot(Ss, rs)
pylab.xlabel('number of samples')
pylab.ylabel('probability of error')
pylab.figure()    
pylab.fill_between(Ss, mins, maxs, color='#888888')
pylab.xlabel('number of samples')
pylab.ylabel('confidence interval')
pylab.show()

Okay, that's better than before, but it's still a bit problematic.

Why is it wrong more than 5% of the time?

We're fine on the assumption that it's normally distributed data...

But there's another assumption: that you have enough samples. What is enough?



In [38]:

    
def test_inside_bounds(n_samples):
    data = rng.normal(size=n_samples)
    ci = normal_ci(data)
    if ci[0] > 0 or ci[1] < 0:
        return 1, ci
    else:
        return 0, ci
    
Ss = [5, 10, 15, 20, 25, 30, 35, 40]
rs = []
mins = []
maxs = []
for S in Ss:
    results = [test_inside_bounds(n_samples=S) for i in range(10000)]
    rs.append(np.mean([r[0] for r in results]))
    mins.append(np.mean([r[1][0] for r in results]))
    maxs.append(np.mean([r[1][1] for r in results]))
    print S, rs[-1]

    
pylab.figure()   
pylab.title('Normal distribution')
pylab.plot(Ss, rs)
pylab.xlabel('number of samples')
pylab.ylabel('probability of error')
pylab.figure()    
pylab.fill_between(Ss, mins, maxs, color='#888888')
pylab.xlabel('number of samples')
pylab.ylabel('confidence interval')
pylab.show()

General rule of thumb: Need about 20 samples to do a 95% confidence interval

Why? What would the rule of thumb be for an 80% confidence interval?



In [44]:

    
def needed_samples(p=0.95):
    return 1.0 / (1-p)

print 0.95, needed_samples(p=0.95)
print 0.8, needed_samples(p=0.8)

The idea here is that you really should have enough samples that you should expect one to be outside the range. If you don't have any, then it's very hard to figure out where the range should be.

Other distributions

But the algorithm described only works if the data is normally distributed. What happens if it isn't?

Uniform distribution



In [47]:

    
def test_inside_bounds(n_samples):
    data = rng.uniform(-1,1, size=n_samples)
    ci = normal_ci(data)
    if ci[0] > 0 or ci[1] < 0:
        return 1, ci
    else:
        return 0, ci
    
Ss = [5, 10, 15, 20, 25, 30, 35, 40]
rs = []
mins = []
maxs = []
for S in Ss:
    results = [test_inside_bounds(n_samples=S) for i in range(10000)]
    rs.append(np.mean([r[0] for r in results]))
    mins.append(np.mean([r[1][0] for r in results]))
    maxs.append(np.mean([r[1][1] for r in results]))
    print S, rs[-1]

    
pylab.figure()    
pylab.title('Uniform distribution')
pylab.plot(Ss, rs)
pylab.xlabel('number of samples')
pylab.ylabel('probability of error')
pylab.figure()    
pylab.fill_between(Ss, mins, maxs, color='#888888')
pylab.xlabel('number of samples')
pylab.ylabel('confidence interval')
pylab.show()



In [48]:

    
def test_inside_bounds(n_samples):
    data = rng.binomial(1, p=0.5, size=n_samples)-0.5
    ci = normal_ci(data)
    if ci[0] > 0 or ci[1] < 0:
        return 1, ci
    else:
        return 0, ci
    
Ss = [5, 10, 15, 20, 25, 30, 35, 40]
rs = []
mins = []
maxs = []
for S in Ss:
    results = [test_inside_bounds(n_samples=S) for i in range(10000)]
    rs.append(np.mean([r[0] for r in results]))
    mins.append(np.mean([r[1][0] for r in results]))
    maxs.append(np.mean([r[1][1] for r in results]))
    print S, rs[-1]

    
pylab.figure()    
pylab.title('1-sample binomial distribution')
pylab.plot(Ss, rs)
pylab.xlabel('number of samples')
pylab.ylabel('probability of error')
pylab.figure()    
pylab.fill_between(Ss, mins, maxs, color='#888888')
pylab.xlabel('number of samples')
pylab.ylabel('confidence interval')
pylab.show()



In [49]:

    
def test_inside_bounds(n_samples):
    data = rng.gamma(0.1, size=n_samples)-0.1
    ci = normal_ci(data)
    if ci[0] > 0 or ci[1] < 0:
        return 1, ci
    else:
        return 0, ci
    
Ss = [5, 10, 15, 20, 25, 30, 35, 40]
rs = []
mins = []
maxs = []
for S in Ss:
    results = [test_inside_bounds(n_samples=S) for i in range(10000)]
    rs.append(np.mean([r[0] for r in results]))
    mins.append(np.mean([r[1][0] for r in results]))
    maxs.append(np.mean([r[1][1] for r in results]))
    print S, rs[-1]

    
pylab.figure()    
pylab.title('gamma distribution')
pylab.plot(Ss, rs)
pylab.xlabel('number of samples')
pylab.ylabel('probability of error')
pylab.figure()    
pylab.fill_between(Ss, mins, maxs, color='#888888')
pylab.xlabel('number of samples')
pylab.ylabel('confidence interval')
pylab.show()

It's not too bad for other distributions, unless they're really pathological. Still, it doesn't get the expected 5% error unless all the assumptions hold: normally distributed data and enough samples (~50 to be safe).

Other statistics

Oh, and one other assumption: it only works for computing the mean.

What if I want a confidence interval on my standard deviation? Or on the median? Or on the regression coefficient? Or something more complicated than that?

There are analytical results for a few of these, but we'd really like something that works for whatever statistic I want to do. This is especially the case for situations where we're looking at human data and it's the distribution of performance that matters, not just the mean.

Bootstrap Confidence Intervals

The term "bootstrap" refers to a whole family of algorithms that involve re-sampling from the samples you already have to approximate the underlying distribution.

The core idea for a bootstrap confidence interval is to simulate the process of running your experiment an infinite number of times. However, we can't actually re-run the experiment. Instead, what we do is to take your sample data and pretend that is the exact distribution of results. So we can simulate re-running the experiment by sampling from the results we already have.

In other words, if I have 20 data points, I can estimate what would happen if I re-ran that full experiment once by just sampling 20 values from those 20 data points. This, of course, has to be sampling with replacement, otherwise I'd just get the original data back again. So I can repeat the experiment and do that same measure on it (say, compute the mean).

Just doing it once doesn't tell me what would happen if I did it an infinite number of times. So let's do it over and over again. The general number is 3000 (which is pretty close to infinity). Now you have 3000 computations of the mean. To determine your range, sort them, throw out the bottom 2.5% and the top 2.5%, and you're left with the 95% confidence interval.



In [51]:

    
def bootstrap_ci(data, func, n=3000, p=0.95):
    index = int(n*(1-p)/2)

    samples = np.random.choice(data, size=(n, len(data)))
    try:
        r = func(samples, axis=1)  # if the function supports axis
    except TypeError:
        r = [func(s) for s in samples]  # otherwise do it the slow way
    r.sort()
    return r[index], r[-index]



In [54]:

    
def test_inside_bounds(n_samples):
    data = rng.normal(size=n_samples)
    ci = bootstrap_ci(data, np.mean)
    if ci[0] > 0 or ci[1] < 0:
        return 1, ci
    else:
        return 0, ci
    
Ss = [5, 10, 15, 20, 25, 30, 35, 40]
rs = []
mins = []
maxs = []
for S in Ss:
    results = [test_inside_bounds(n_samples=S) for i in range(1000)]
    rs.append(np.mean([r[0] for r in results]))
    mins.append(np.mean([r[1][0] for r in results]))
    maxs.append(np.mean([r[1][1] for r in results]))
    print S, rs[-1]

    
pylab.figure()   
pylab.title('Normal distribution')
pylab.plot(Ss, rs)
pylab.xlabel('number of samples')
pylab.ylabel('probability of error')
pylab.figure()    
pylab.fill_between(Ss, mins, maxs, color='#888888')
pylab.xlabel('number of samples')
pylab.ylabel('confidence interval')
pylab.show()



In [56]:

    
def test_inside_bounds(n_samples):
    data = rng.binomial(1, p=0.5, size=n_samples)-0.5
    ci = bootstrap_ci(data, np.mean)
    if ci[0] > 0 or ci[1] < 0:
        return 1, ci
    else:
        return 0, ci
    
Ss = [5, 10, 15, 20, 25, 30, 35, 40]
rs = []
mins = []
maxs = []
for S in Ss:
    results = [test_inside_bounds(n_samples=S) for i in range(10000)]
    rs.append(np.mean([r[0] for r in results]))
    mins.append(np.mean([r[1][0] for r in results]))
    maxs.append(np.mean([r[1][1] for r in results]))
    print S, rs[-1]

    
pylab.figure()    
pylab.title('1-sample binomial distribution')
pylab.plot(Ss, rs)
pylab.xlabel('number of samples')
pylab.ylabel('probability of error')
pylab.figure()    
pylab.fill_between(Ss, mins, maxs, color='#888888')
pylab.xlabel('number of samples')
pylab.ylabel('confidence interval')
pylab.show()



In [57]:

    
def test_inside_bounds(n_samples):
    data = rng.gamma(0.1, size=n_samples)-0.1
    ci = bootstrap_ci(data, np.mean)
    if ci[0] > 0 or ci[1] < 0:
        return 1, ci
    else:
        return 0, ci
    
Ss = [5, 10, 15, 20, 25, 30, 35, 40]
rs = []
mins = []
maxs = []
for S in Ss:
    results = [test_inside_bounds(n_samples=S) for i in range(10000)]
    rs.append(np.mean([r[0] for r in results]))
    mins.append(np.mean([r[1][0] for r in results]))
    maxs.append(np.mean([r[1][1] for r in results]))
    print S, rs[-1]

    
pylab.figure()    
pylab.title('gamma distribution')
pylab.plot(Ss, rs)
pylab.xlabel('number of samples')
pylab.ylabel('probability of error')
pylab.figure()    
pylab.fill_between(Ss, mins, maxs, color='#888888')
pylab.xlabel('number of samples')
pylab.ylabel('confidence interval')
pylab.show()

The bootstrap CI is worse for normally distributed data (unsurprising, since that's exactly the assumption the normal approach makes), but better for other distributions.

Also, notice that for the gamma distribution, the confidence interval is non-symmetric! This highlights another problem with the standard approach: it always gives symmetric distributions. The bootstrap does not.

However, also note that all these methods are having a hard time with the gamma distribution, and are giving errors much more than 5% of the time. Horrible distributions are horrible.

Bootstrap CI Variants

There are a few common variants of Boostrap CI. The one discussed above is the default, sometimes referred to as the "Percentile Interval" method.

Another common one is $BC_a$, or the Bias-Corrected Accelerated bootstrap confidence interval. It attempts to cancel out the sampling bias in the bootstrap, using a similar trick as in normal stats when you replace the z-distribution with the Student's t-distribution.

Implementing it is a pain, so I'm using someone else's implementation here, taken from

https://github.com/cgevans/scikits-bootstrap

It is slower (although that might just be the implementation) and does give slightly better results, but I don't find it makes that much of a difference. But lots of people recommend it anyway, so I'm tempted to start doing it when presenting results in papers.



In [67]:

    
import scikits.bootstrap

def test_inside_bounds(n_samples):
    data = rng.normal(size=n_samples)
    ci = scikits.bootstrap.ci(data, np.average, method='bca', n_samples=3000)
    if ci[0] > 0 or ci[1] < 0:
        return 1, ci
    else:
        return 0, ci
    
Ss = [5, 10, 15, 20, 25, 30, 35, 40]
rs = []
mins = []
maxs = []
for S in Ss:
    results = [test_inside_bounds(n_samples=S) for i in range(1000)]
    rs.append(np.mean([r[0] for r in results]))
    mins.append(np.mean([r[1][0] for r in results]))
    maxs.append(np.mean([r[1][1] for r in results]))
    print S, rs[-1]

    
pylab.figure()   
pylab.title('Normal distribution')
pylab.plot(Ss, rs)
pylab.xlabel('number of samples')
pylab.ylabel('probability of error')
pylab.figure()    
pylab.fill_between(Ss, mins, maxs, color='#888888')
pylab.xlabel('number of samples')
pylab.ylabel('confidence interval')
pylab.show()









    



c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)
c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)






    



5 0.165
10 0.109
15 0.084
20 0.067
25 0.066
30 0.071
35 0.057
40 0.071






    



c:\users\terry\documents\github\scikits-bootstrap\scikits\bootstrap\bootstrap.py:220: InstabilityWarning: Some values used top 10 low/high samples; results may be unstable.
  warnings.warn("Some values used top 10 low/high samples; results may be unstable.", InstabilityWarning)

Another variant is ABC, or Approximate Bootstrap Confidence. This attempts to approximate what bootstrapping would do without actually bootstrapping. I have no idea how it works and what assumptions are being made, but it can be handy for getting quick estimates. Again, I'm using the implementation from https://github.com/cgevans/scikits-bootstrap which doesn't quite seem to be as fast as it could be, so I'm not seeing much speedup over standard boostrap, but that might be able to be improved.



In [66]:

    
import scikits.bootstrap

def test_inside_bounds(n_samples):
    data = rng.normal(size=n_samples)
    ci = scikits.bootstrap.ci(data, np.average, method='abc', n_samples=3000)
    if ci[0] > 0 or ci[1] < 0:
        return 1, ci
    else:
        return 0, ci
    
Ss = [5, 10, 15, 20, 25, 30, 35, 40]
rs = []
mins = []
maxs = []
for S in Ss:
    results = [test_inside_bounds(n_samples=S) for i in range(1000)]
    rs.append(np.mean([r[0] for r in results]))
    mins.append(np.mean([r[1][0] for r in results]))
    maxs.append(np.mean([r[1][1] for r in results]))
    print S, rs[-1]

    
pylab.figure()   
pylab.title('Normal distribution')
pylab.plot(Ss, rs)
pylab.xlabel('number of samples')
pylab.ylabel('probability of error')
pylab.figure()    
pylab.fill_between(Ss, mins, maxs, color='#888888')
pylab.xlabel('number of samples')
pylab.ylabel('confidence interval')
pylab.show()

Comparison to Statistical Significance

If something is "statisticaly significant", this means that there's some confidence interval that does not include some special number. For example, if I say group A is better than group B (p<0.05), this means that the 95% confidence interval of the difference between the means does not include zero.

We can compute this by just doing the bootstrap ci of the function np.mean(A)-np.mean(B). Notice that to do this, we resample A and resample B, compute the means, find the difference, and repeat that process 3000 times.

(Indeed, this is exactly the process used in the t-test, but they use the confidence interval calculation that assumes everything is normally distributed).

However, it is always more useful to actually report the confidence interval. That is, instead of saying:

group A is better than group B (p<0.05)

You should say:

group A is better than group B by between 11.3 and 12.7 points (95% bootstrap C.I.)

Notice that this lets you tell the difference between that result and:

group A is better than group B by between 0.0001 and 0.0003 points (95% bootstrap C.I.)

(even though that's also "statistically significant")

Science rant: so much of science would be so improved if we always reported this instead of p-values!

Conclusion

Whenever you're computing any statistic on any data, get into the habit of also computing the bootstrap CI. You can really only assume that the actual statistic you're interested in is somewhere in that range. You should completely ignore the actual sample statistic.
With small numbers of samples:
- If you have 1 data point, you can conclude nothing
- If you have 2 data points, you can conclude that the mean is somewhere between them, but you will be wrong at least 50% of the time.
- If you have 3 data points, you can conclude that the mean is somewhere between the min and max, but you will be wrong at least 25% of the time.
- If you have 4 data points, you can conclude that the mean is somewhere between the min and max, but you will be wrong at least 12.5% of the time.
- If you have 5 data points, you can conclude that the mean is somewhere between the min and max, but you will be wrong at least 6.25% of the time.
- if you have 5 data points, you can start computing confidence intervals (up to 80%) on whatever statistic you want, but you'll be wrong a least 20% of the time (and 30% might be a better estimate)
- if you have 10 data points, you can start computing confidence intervals (up to 90%) on whatever statistic you want, but you'll be wrong a least 10% of the time (and 15% might be a better estimate)
- if you have 20 data points, you can start computing 95% confidence intervals on whatever statistic you want, but you'll be wrong a least 5% of the time (and 7.5% might be a better estimate)
- if you have 40 data points, you can start computing 95% confidence intervals on whatever statistic you want, but you'll be wrong a least 5% of the time
This is really all the algorithm is:

def bootstrap_ci(data, func, n=3000, p=0.95):
    index = int(n*(1-p)/2)
    samples = np.random.choice(data, size=(n, len(data)))
    r = [func(s) for s in samples]
    r.sort()
    return r[index], r[-index]

Use $BC_a$ in a paper to get slightly tighter bounds, if you feel like it
There's a package https://github.com/cgevans/scikits-bootstrap that's worth looking at, but seems slower than it needs to be (I'm not sure why).



In [ ]: